require(Hmisc)
knitrSet(lang='markdown', h=4.5)
options(grType='plotly')
mu <- markupSpecs$html
Install the packages Hmisc, rms, knitr, rmarkdown through RStudio.
Packages contain R code and may often require the installation of other packages. When I install Hmisc, I see that its dependency acepack is automatically installed:
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb
* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
Install through CRAN repositories
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown'))
# update all packages
update.packages(checkBuilt=TRUE, ask=FALSE)
Install through Github repositories
install.packages('devtools')
require(devtools)
install_github('harrelfe/rms')
Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.
library(Hmisc)
require(Hmisc)
The getHdata function is used to fetch a dataset from the Vanderbilt DataSets web site. upData is used to
units attributed used by Hmisc and rms functions for table making and graphicscontents is used to print a data dictionary, run through an html method for nicer output.
getHdata(pbc)
pbc <- upData(pbc,
fu.yrs = fu.days / 365.25,
labels = c(fu.yrs = 'Follow-up Time',
status = 'Death or Liver Transplantation'),
units = c(fu.yrs = 'year'),
drop = 'fu.days',
moveUnits=TRUE, html=TRUE)
Input object size: 76592 bytes; 19 variables 418 observations
Label for bili changed from Serum Bilirubin (mg/dl) to Serum Bilirubin
units set to mg/dl
Label for albumin changed from Albumin (gm/dl) to Albumin
units set to gm/dl
Label for protime changed from Prothrombin Time (sec.) to Prothrombin Time
units set to sec.
Label for alk.phos changed from Alkaline Phosphatase (U/liter) to Alkaline Phosphatase
units set to U/liter
Label for sgot changed from SGOT (U/ml) to SGOT
units set to U/ml
Label for chol changed from Cholesterol (mg/dl) to Cholesterol
units set to mg/dl
Label for trig changed from Triglycerides (mg/dl) to Triglycerides
units set to mg/dl
Label for platelet changed from Platelets (per cm^3/1000) to Platelets
units set to per cm^3/1000
Label for copper changed from Urine Copper (ug/day) to Urine Copper
units set to ug/day
Added variable fu.yrs
Dropped variable fu.days
New object size: 80224 bytes; 19 variables 418 observations
html(contents(pbc), maxlevels=10, levelType='table')
| Name | Labels | Units | Levels | Storage | NAs |
|---|---|---|---|---|---|
| bili | Serum Bilirubin | mg/dl | double | 0 | |
| albumin | Albumin | gm/dl | double | 0 | |
| stage | Histologic Stage, Ludwig Criteria | integer | 6 | ||
| protime | Prothrombin Time | sec. | double | 2 | |
| sex | 2 | integer | 0 | ||
| age | Age | double | 0 | ||
| spiders | 2 | integer | 106 | ||
| hepatom | 2 | integer | 106 | ||
| ascites | 2 | integer | 106 | ||
| alk.phos | Alkaline Phosphatase | U/liter | double | 106 | |
| sgot | SGOT | U/ml | double | 106 | |
| chol | Cholesterol | mg/dl | integer | 134 | |
| trig | Triglycerides | mg/dl | integer | 136 | |
| platelet | Platelets | per cm^3/1000 | integer | 110 | |
| drug | 3 | integer | 0 | ||
| status | Death or Liver Transplantation | integer | 0 | ||
| edema | 3 | integer | 0 | ||
| copper | Urine Copper | ug/day | integer | 108 | |
| fu.yrs | Follow-up Time | year | double | 0 |
| Variable | Levels |
|---|---|
| sex | male |
| female | |
| spiders, hepatom | absent |
| ascites | present |
| drug | D-penicillamine |
| placebo | |
| not randomized | |
| edema | no edema |
| edema, no diuretic therapy | |
| edema despite diuretic therapy |
# did have results='asis' above
d <- describe(pbc)
html(d, size=80, scroll=TRUE)
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 418 | 0 | 98 | 0.998 | 3.221 | 3.742 | 0.50 | 0.60 | 0.80 | 1.40 | 3.40 | 8.03 | 14.00 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 418 | 0 | 154 | 1 | 3.497 | 0.473 | 2.750 | 2.967 | 3.243 | 3.530 | 3.770 | 4.010 | 4.141 |
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 412 | 6 | 4 | 0.893 | 3.024 | 0.9519 |
Value 1 2 3 4 Frequency 21 92 155 144 Proportion 0.051 0.223 0.376 0.350
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 416 | 2 | 48 | 0.998 | 10.73 | 1.029 | 9.60 | 9.80 | 10.00 | 10.60 | 11.10 | 12.00 | 12.45 |
| n | missing | distinct |
|---|---|---|
| 418 | 0 | 2 |
Value male female Frequency 44 374 Proportion 0.105 0.895
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 418 | 0 | 345 | 1 | 50.74 | 11.96 | 33.84 | 36.37 | 42.83 | 51.00 | 58.24 | 64.30 | 67.92 |
| n | missing | distinct |
|---|---|---|
| 312 | 106 | 2 |
Value absent present Frequency 222 90 Proportion 0.712 0.288
| n | missing | distinct |
|---|---|---|
| 312 | 106 | 2 |
Value absent present Frequency 152 160 Proportion 0.487 0.513
| n | missing | distinct |
|---|---|---|
| 312 | 106 | 2 |
Value absent present Frequency 288 24 Proportion 0.923 0.077
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 312 | 106 | 295 | 1 | 1983 | 1760 | 599.6 | 663.0 | 871.5 | 1259.0 | 1980.0 | 3826.4 | 6669.9 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 312 | 106 | 179 | 1 | 122.6 | 60.45 | 54.25 | 60.45 | 80.60 | 114.70 | 151.90 | 196.47 | 219.25 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 284 | 134 | 201 | 1 | 369.5 | 194.5 | 188.4 | 213.6 | 249.5 | 309.5 | 400.0 | 560.8 | 674.0 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 282 | 136 | 146 | 1 | 124.7 | 64.07 | 56.00 | 63.10 | 84.25 | 108.00 | 151.00 | 195.00 | 230.95 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 308 | 110 | 210 | 1 | 261.9 | 107.8 | 117.7 | 139.7 | 199.8 | 257.0 | 322.5 | 386.5 | 430.6 |
| n | missing | distinct |
|---|---|---|
| 418 | 0 | 3 |
Value D-penicillamine placebo not randomized Frequency 154 158 106 Proportion 0.368 0.378 0.254
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 418 | 0 | 2 | 0.71 | 161 | 0.3852 | 0.4748 |
| n | missing | distinct |
|---|---|---|
| 418 | 0 | 3 |
Value no edema edema, no diuretic therapy
Frequency 354 44
Proportion 0.847 0.105
Value edema despite diuretic therapy
Frequency 20
Proportion 0.048
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 310 | 108 | 158 | 1 | 97.65 | 83.16 | 17.45 | 24.00 | 41.25 | 73.00 | 123.00 | 208.10 | 249.20 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 418 | 0 | 399 | 1 | 5.251 | 3.429 | 0.671 | 1.661 | 2.992 | 4.736 | 7.155 | 9.649 | 11.063 |
| lowest : | 0.1122519 | 0.1177276 | 0.1396304 | 0.1943874 | 0.2108145 |
| highest: | 12.3203285 | 12.3449692 | 12.3832991 | 12.4736482 | 13.1279945 |
plot(d)
$Categorical
$Continuous NAProduce stratified quantiles, means/SD, and proportions by treatment group. Plot the results before rendering as an advanced html table:
s <- summaryM(bili + albumin + stage + protime + sex + age + spiders +
alk.phos + sgot + chol ~ drug, data=pbc,
overall=FALSE, test=TRUE)
plot(s, which='categorical')
Ignoring 1 observationsIgnoring 1 observations
plot(s, which='continuous', vars=1:4)
Ignoring 3 observationsIgnoring 3 observationsIgnoring 3 observationsIgnoring 3 observations
plot(s, which='continuous', vars=5:7)
Ignoring 2 observationsIgnoring 2 observationsIgnoring 2 observations
html(s, caption='Baseline characteristics by randomized treatment',
exclude1=TRUE, npct='both', digits=3,
prmsd=TRUE, brmsd=TRUE, msdsize=mu$smaller2)
| Baseline characteristics by randomized treatment. | |||||
| N |
D-penicillamine N=154 |
placebo N=158 |
not randomized N=106 |
Test Statistic |
|
|---|---|---|---|---|---|
Serum Bilirubin mg/dl |
418 | 0.725 1.300 3.600 3.649 ± 5.282 |
0.800 1.400 3.200 2.873 ± 3.629 |
0.725 1.400 3.075 3.117 ± 4.043 |
F2 415=0.03, P=0.9721 |
Albumin gm/dl |
418 | 3.342 3.545 3.777 3.524 ± 0.396 |
3.212 3.565 3.830 3.516 ± 0.443 |
3.125 3.470 3.720 3.431 ± 0.435 |
F2 415=2.13, P=0.121 |
| Histologic Stage, Ludwig Criteria : 1 | 412 | 0.03 4⁄154 | 0.08 12⁄158 | 0.05 5⁄100 | χ26=5.33, P=0.5022 |
| 2 | 0.21 32⁄154 | 0.22 35⁄158 | 0.25 25⁄100 | ||
| 3 | 0.42 64⁄154 | 0.35 56⁄158 | 0.35 35⁄100 | ||
| 4 | 0.35 54⁄154 | 0.35 55⁄158 | 0.35 35⁄100 | ||
Prothrombin Time sec. |
416 | 10.000 10.600 11.400 10.800 ± 1.138 |
10.025 10.600 11.000 10.653 ± 0.851 |
10.100 10.600 11.000 10.750 ± 1.078 |
F2 413=0.23, P=0.7951 |
| sex : female | 418 | 0.90 139⁄154 | 0.87 137⁄158 | 0.92 98⁄106 | χ22=2.38, P=0.3042 |
| Age | 418 | 41.43 48.11 55.80 48.58 ± 9.96 |
42.98 51.93 58.90 51.42 ± 11.01 |
46.00 53.00 61.00 52.87 ± 9.78 |
F2 415=6.1, P=0.0021 |
| spiders | 312 | 0.29 45⁄154 | 0.28 45⁄158 | χ21=0.02, P=0.8852 | |
Alkaline Phosphatase U/liter |
312 | 922 1283 1950 1943 ± 2102 |
841 1214 2028 2021 ± 2183 |
F1 310=0.06, P=0.8123 | |
SGOT U/ml |
312 | 83.8 117.4 151.9 125.0 ± 58.9 |
76.7 111.6 151.5 120.2 ± 54.5 |
F1 310=0.55, P=0.463 | |
Cholesterol mg/dl |
284 | 254 304 377 374 ± 252 |
248 316 417 365 ± 210 |
F1 282=0.37, P=0.5453 | |
|
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. Tests used: 1Kruskal-Wallis test; 2Pearson test; 3Wilcoxon test . | |||||
p <- with(pbc, histboxp(x=sgot, group=drug, sd=TRUE))
p
The very low birthweight data set contains data on 671 infants born with a birth weight of under 1600 grams. We’ll plot gestational age by birthweight using three graphics systems: base graphics, ggplot, and plotly.
getHdata(vlbw)
# remove missing values
vlbw <- vlbw[complete.cases(vlbw[,c('sex','dead','gest','bwt')]),]
Build each element into your plot.
grps <- split(vlbw[,c('gest','bwt')], vlbw[,c('sex','dead')])
plot(c(22,40), c(400,1600), type='n', xlab='Gestational Age', ylab='Birth Weight (grams)', axes=FALSE)
axis(1, at=c(22,28,34,40), labels=c(22,28,34,40))
axis(2, at=seq(400,1600,by=400), labels=seq(400,1600,by=400))
points(grps[['female.0']], col='black', pch=1)
points(grps[['female.1']][,'gest'], grps[['female.1']][,'bwt'], col='black', pch=0)
points(jitter(grps[['male.0']][,'gest'], 2), grps[['male.0']][,'bwt'], col='gray', pch=4)
points(grps[['male.0']][,'bwt'] ~ jitter(grps[['male.0']][,'gest'], 2), col='gray', pch=3)
legend(x=38, y=1000, legend=c('F:0','F:1','M:0','M:1'), col=c('black','black','gray','gray'), pch=c(1,0,4,3))
Given a data set, choose the aesthetic mapping and geometry layer.
p <- ggplot(data=vlbw) + aes(x=gest, y=bwt, color=sex, shape=as.factor(dead)) + geom_point()
p
p <- p + geom_jitter() + scale_x_continuous() + scale_y_continuous()
p
Add interactive graphics, which is trivial when also using ggplot.
require(plotly)
Loading required package: plotly
Attaching package: ‘plotly’
The following object is masked from ‘package:Hmisc’:
subplot
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
ggplotly(p)
plot_ly(type="box") %>%
add_boxplot(y = grps[['female.0']][,'bwt'], name='F:0') %>%
add_boxplot(y = grps[['female.1']][,'bwt'], name='F:1') %>%
add_boxplot(y = grps[['male.0']][,'bwt'], name='M:0') %>%
add_boxplot(y = grps[['male.1']][,'bwt'], name='M:1')
getHdata(diabetes)
html(contents(diabetes), levelType='table')
| Name | Labels | Units | Levels | Storage | NAs |
|---|---|---|---|---|---|
| id | Subject ID | integer | 0 | ||
| chol | Total Cholesterol | integer | 1 | ||
| stab.glu | Stabilized Glucose | integer | 0 | ||
| hdl | High Density Lipoprotein | integer | 1 | ||
| ratio | Cholesterol/HDL Ratio | double | 1 | ||
| glyhb | Glycosolated Hemoglobin | double | 13 | ||
| location | 2 | integer | 0 | ||
| age | years | integer | 0 | ||
| gender | 2 | integer | 0 | ||
| height | inches | integer | 5 | ||
| weight | pounds | integer | 1 | ||
| frame | 3 | integer | 12 | ||
| bp.1s | First Systolic Blood Pressure | integer | 5 | ||
| bp.1d | First Diastolic Blood Pressure | integer | 5 | ||
| bp.2s | Second Systolic Blood Pressure | integer | 262 | ||
| bp.2d | Second Diastolic Blood Pressure | integer | 262 | ||
| waist | inches | integer | 2 | ||
| hip | inches | integer | 2 | ||
| time.ppn | Postprandial Time when Labs were Drawn | minutes | integer | 3 |
| Variable | Levels |
|---|---|
| location | Buckingham |
| Louisa | |
| gender | male |
| female | |
| frame | small |
| medium | |
| large |
The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.
+ indicates to include both a and b as terms: indicates the interaction of a and bI indicates to use + in the arithmetic senseSee ?formula for more examples.
The most simple model-fitting function is lm, which is used to fit linear models. It’s primary argument is a formula. Using the diabetes data set, we can fit waist size by weight.
(m <- lm(waist ~ weight, data=diabetes))
Call:
lm(formula = waist ~ weight, data = diabetes)
Coefficients:
(Intercept) weight
16.5050 0.1205
This creates an lm object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.
names(m)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr"
[8] "df.residual" "na.action" "xlevels" "call" "terms" "model"
m$coefficients
(Intercept) weight
16.5050329 0.1204712
coef(m)
(Intercept) weight
16.5050329 0.1204712
head(fitted(m))
1 2 3 4 5 6
31.08205 42.76776 47.34567 30.84111 38.55127 39.39457
predict(m, data.frame(weight=c(150, 200, 250)))
1 2 3
34.57572 40.59928 46.62284
head(residuals(m))
1 2 3 4 5 6
-2.082053 3.232237 1.654330 2.158890 5.448731 -3.394568
vcov(m)
(Intercept) weight
(Intercept) 0.465908502 -0.0024927458
weight -0.002492746 0.0000140231
summary(m)
Call:
lm(formula = waist ~ weight, data = diabetes)
Residuals:
Min 1Q Median 3Q Max
-7.6835 -2.0850 -0.2021 1.7741 11.5330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.505033 0.682575 24.18 <2e-16 ***
weight 0.120471 0.003745 32.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.02 on 398 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.7223, Adjusted R-squared: 0.7216
F-statistic: 1035 on 1 and 398 DF, p-value: < 2.2e-16
coef(summary(m))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.5050329 0.682574906 24.18054 3.970142e-80
weight 0.1204712 0.003744743 32.17077 9.082965e-113
anova(m)
Analysis of Variance Table
Response: waist
Df Sum Sq Mean Sq F value Pr(>F)
weight 1 9438.0 9438.0 1035 < 2.2e-16 ***
Residuals 398 3629.4 9.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Create a scatterplot of weight and waist size.
p <- qplot(weight, waist, data=diabetes)
p + geom_smooth(method="lm")
Remove missing values and fit with LOESS curve.
diab <- diabetes[complete.cases(diabetes[,c('waist','weight')]),]
p <- qplot(weight, waist, data=diab)
p + geom_smooth(method="loess")
The Dominican Republic Hypertension dataset contains data on gender, age, and systolic and diastolic blood pressure from several villages in the DR.
getHdata(DominicanHTN)
html(describe(DominicanHTN))
| n | missing | distinct |
|---|---|---|
| 381 | 0 | 8 |
Value Bare' Nuevo Batey Verde Carmona Cojobal Juan Sanchez
Frequency 32 41 64 59 53
Proportion 0.084 0.108 0.168 0.155 0.139
Value La Altagracia Los Gueneos San Antonio
Frequency 40 57 35
Proportion 0.105 0.150 0.092
| n | missing | distinct |
|---|---|---|
| 381 | 0 | 2 |
Value Female Male Frequency 258 123 Proportion 0.677 0.323
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 381 | 0 | 69 | 0.999 | 47.97 | 16.98 | 23 | 30 | 38 | 47 | 59 | 68 | 73 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 381 | 0 | 57 | 0.996 | 133 | 28.35 | 98 | 105 | 118 | 130 | 150 | 170 | 180 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 381 | 0 | 41 | 0.993 | 84.23 | 15.58 | 64 | 70 | 76 | 82 | 92 | 100 | 110 |
DominicanHTN[,'map'] <- (DominicanHTN[,'sbp'] + DominicanHTN[,'dbp'] * 2) / 3
DominicanHTN[,'male'] <- as.numeric(DominicanHTN[,'gender'] == 'Male')
DominicanHTN[1:5,]
qage <- quantile(DominicanHTN[,'age'])
qage[1] <- qage[1]-1
DominicanHTN[,'ageGrp'] <- cut(DominicanHTN[,'age'], breaks=qage)
nrow(DominicanHTN)
[1] 381
nrow(DominicanHTN[DominicanHTN[,'gender'] == "Male",])
[1] 123
which(DominicanHTN[,'village'] %in% c('Carmona','San Antonio'))
[1] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
[32] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174
[63] 175 176 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
[94] 278 279 280 281 282 283
cojMen <- DominicanHTN[DominicanHTN[,'village'] == 'Cojobal' & DominicanHTN[,'gender'] == "Male",]
cojMen
cojMen[order(cojMen[,'age'], decreasing=TRUE),]
table(DominicanHTN[,'gender'])
Female Male
258 123
addmargins(with(DominicanHTN, table(village, gender)))
gender
village Female Male Sum
Bare' Nuevo 19 13 32
Batey Verde 30 11 41
Carmona 39 25 64
Cojobal 47 12 59
Juan Sanchez 40 13 53
La Altagracia 22 18 40
Los Gueneos 41 16 57
San Antonio 20 15 35
Sum 258 123 381
with(DominicanHTN, tapply(age, village, mean))
Bare' Nuevo Batey Verde Carmona Cojobal Juan Sanchez La Altagracia Los Gueneos San Antonio
47.53125 48.02439 49.00000 47.15254 47.45283 49.65000 45.15789 51.25714
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
aggregate(age ~ village + gender, DominicanHTN, median)
The lead exposure dataset was collected to study the well-being of childen who lived near a lead smelting plant. The following analysis considers the lead exposure levels in 1972 and 1973, as well as age and the finger-wrist tapping score maxfwt.
getHdata(lead)
lead <- upData(lead,
keep = c('ld72','ld73','age','maxfwt'),
labels = c(age = 'Age'),
units = c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))
Input object size: 50832 bytes; 39 variables 124 observations Kept variables ld72,ld73,age,maxfwt New object size: 12824 bytes; 4 variables 124 observations
html(contents(lead), maxlevels=10, levelType='table')
| Name | Labels | Units | Storage | NAs |
|---|---|---|---|---|
| age | Age | years | double | 0 |
| ld72 | Blood Lead Levels, 1972 | mg/100*ml | integer | 0 |
| ld73 | Blood Lead Levels, 1973 | mg/100*ml | integer | 0 |
| maxfwt | Maximum mean finger-wrist tapping score | integer | 25 |
html(describe(lead))
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 124 | 0 | 73 | 1 | 8.935 | 4.074 | 3.929 | 4.333 | 6.167 | 8.375 | 12.021 | 14.000 | 15.000 |
| lowest : | 3.750000 | 3.833333 | 3.916667 | 4.000000 | 4.166667 |
| highest: | 14.250000 | 15.000000 | 15.250000 | 15.416667 | 15.916667 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 124 | 0 | 47 | 0.999 | 36.16 | 17.23 | 18.00 | 21.00 | 27.00 | 34.00 | 43.00 | 57.00 | 61.85 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 124 | 0 | 37 | 0.998 | 31.71 | 11.06 | 18.15 | 21.00 | 24.00 | 30.50 | 37.00 | 47.00 | 50.85 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99 | 25 | 40 | 0.998 | 51.96 | 13.8 | 33.2 | 38.0 | 46.0 | 52.0 | 59.0 | 65.0 | 72.2 |
The rms package includes the ols fitting function. Use datadist to compute summaries of the distributional charateristics of the predictors - or simply give it an entire data frame. datadist must be re-run if you add a new predictor or recode an old one.
require(rms)
Loading required package: rms
Loading required package: SparseM
Attaching package: ‘SparseM’
The following object is masked from ‘package:base’:
backsolve
dd <- datadist(lead)
options(datadist='dd')
f <- ols(maxfwt ~ age + ld72 + ld73, data=lead)
Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.
Consider 'structure(list(), *)' instead.
f
Frequencies of Missing Values Due to Each Variable
maxfwt age ld72 ld73
25 0 0 0
Linear Regression Model
ols(formula = maxfwt ~ age + ld72 + ld73, data = lead)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 99 LR chi2 62.25 R2 0.467
sigma9.5221 d.f. 3 R2 adj 0.450
d.f. 95 Pr(> chi2) 0.0000 g 10.104
Residuals
Min 1Q Median 3Q Max
-33.9958 -4.9214 0.7596 5.1106 33.2590
Coef S.E. t Pr(>|t|)
Intercept 34.1059 4.8438 7.04 <0.0001
age 2.6078 0.3231 8.07 <0.0001
ld72 -0.0246 0.0782 -0.31 0.7538
ld73 -0.2390 0.1325 -1.80 0.0744
rms provides many methods to work with the ols fit object.
coef(f)
Intercept age ld72 ld73
34.1058551 2.6078450 -0.0245978 -0.2389695
The default values for function arguments are medians.
g <- Function(f)
g
function(age = 8.375,ld72 = 34,ld73 = 30.5) {34.105855+2.607845*age-0.024597799*ld72-0.23896951*ld73 }
<environment: 0x78ca898>
g(age=10, ld72=21, ld73=c(21, 47))
[1] 54.64939 48.43618
predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)
$linear.predictors
1 2
54.64939 48.43618
$se.fit
1 2
1.391858 3.140361
r <- resid(f)
par(mfrow=c(2,2))
plot(fitted(f), r); abline(h=0)
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
# q-q plot to check normality
qqnorm(r)
Predict and pplot make one plot for each predictor. Predictors not shown in plot are set to constants (continuous:median, categorical:mode). 0.95 pointwise confidence limits for \(\hat{E}(y|x)\) are shown; use conf.int=FALSE to suppress CLs.
plotp(Predict(f))
Specify which predictors are plotted.
plotp(Predict(f, age))
plotp(Predict(f, age=3:15))
plotp(Predict(f, age=seq(3,16,length=150)))
Obtain confidence limits for \(\hat{y}\).
plotp(Predict(f, age, conf.type='individual'))
Combine plots.
p1 <- Predict(f, age, conf.int=0.99, conf.type='individual')
p2 <- Predict(f, age, conf.int=0.99, conf.type='mean')
p <- rbind(Individual=p1, Mean=p2)
ggplot(p)
plotp(Predict(f, ld73, age=3))
ggplot(Predict(f, ld73, age=c(3,9)))
3-d surface for two continuous predictors against \(\hat{y}\).
bplot(Predict(f, ld72, ld73))
plot(nomogram(f))
summary(f)
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 6.1667 12.021 5.8542 15.26700 1.8914 11.5120 19.02200
ld72 27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773 2.09010
ld73 24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255 0.31234
Adjust age to 5, which has no effect as the model does not include any interactions.
summary(f, age=5)
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 6.1667 12.021 5.8542 15.26700 1.8914 11.5120 19.02200
ld72 27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773 2.09010
ld73 24.0000 37.000 13.0000 -3.10660 1.7222 -6.5255 0.31234
Effect of changing ld73 from 20 to 40.
summary(f, ld73=c(20,40))
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 6.1667 12.021 5.8542 15.26700 1.8914 11.5120 19.02200
ld72 27.0000 43.000 16.0000 -0.39356 1.2511 -2.8773 2.09010
ld73 20.0000 40.000 20.0000 -4.77940 2.6495 -10.0390 0.48052
If a predictor has a linear effect, a one-unit change can be used to get the confidence interval of its slope.
summary(f, age=5:6)
Effects Response : maxfwt
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 5 6 1 2.60780 0.32308 1.9664 3.24920
ld72 27 43 16 -0.39356 1.25110 -2.8773 2.09010
ld73 24 37 13 -3.10660 1.72220 -6.5255 0.31234
There is also a plot method for summary results.
plot(summary(f))
Give predict a fit object and data.frame.
predict(f, data.frame(age=3, ld72=21, ld73=21))
1
36.39448
predict(f, data.frame(age=c(3,10), ld72=21, ld73=c(21,47)))
1 2
36.39448 48.43618
newdat <- expand.grid(age=c(4,8), ld72=c(21, 47), ld73=c(21, 47))
newdat
predict(f, newdat)
1 2 3 4 5 6 7 8
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095
Include confidence levels.
predict(f, newdat, conf.int=0.95)
$linear.predictors
1 2 3 4 5 6 7 8
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095
$lower
1 2 3 4 5 6 7 8
33.97441 46.23595 32.15468 43.94736 25.68920 36.94167 27.17060 38.86475
$upper
1 2 3 4 5 6 7 8
44.03023 52.63145 44.57088 53.64095 39.88902 49.49932 37.12854 46.29716
predict(f, newdat, conf.int=0.95, conf.type='individual')
$linear.predictors
1 2 3 4 5 6 7 8
39.00232 49.43370 38.36278 48.79416 32.78911 43.22049 32.14957 42.58095
$lower
1 2 3 4 5 6 7 8
19.44127 30.26132 18.46566 29.27888 12.59596 23.30120 12.60105 23.31531
$upper
1 2 3 4 5 6 7 8
58.56337 68.60609 58.25989 68.30944 52.98227 63.13979 51.69810 61.84659
Brute-force predicted values.
b <- coef(f)
b[1] + b[2]*3 + b[3]*21 + b[4]*21
Intercept
36.39448
Predicted values with Function.
# g <- Function(f)
g(age = c(3,8), ld72 = 21, ld73 = 21)
[1] 36.39448 49.43370
g(age = 3)
[1] 33.80449
Use anova to get all total effects and individual partial effects.
an <- anova(f)
an
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P
age 1 5907.535742 5907.535742 65.15 <.0001
ld72 1 8.972994 8.972994 0.10 0.7538
ld73 1 295.044370 295.044370 3.25 0.0744
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001
ERROR 95 8613.750674 90.671060
Include corresponding variable names.
print(an, 'names')
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P Tested
age 1 5907.535742 5907.535742 65.15 <.0001 age
ld72 1 8.972994 8.972994 0.10 0.7538 ld72
ld73 1 295.044370 295.044370 3.25 0.0744 ld73
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001 age,ld72,ld73
ERROR 95 8613.750674 90.671060
print(an, 'subscripts')
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P Tested
age 1 5907.535742 5907.535742 65.15 <.0001 1
ld72 1 8.972994 8.972994 0.10 0.7538 2
ld73 1 295.044370 295.044370 3.25 0.0744 3
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001 1-3
ERROR 95 8613.750674 90.671060
Subscripts correspond to:
[1] age ld72 ld73
print(an, 'dots')
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P Tested
age 1 5907.535742 5907.535742 65.15 <.0001 .
ld72 1 8.972994 8.972994 0.10 0.7538 .
ld73 1 295.044370 295.044370 3.25 0.0744 .
REGRESSION 3 7540.087710 2513.362570 27.72 <.0001 ...
ERROR 95 8613.750674 90.671060
Subscripts correspond to:
[1] age ld72 ld73
Combined partial effects.
anova(f, ld72, ld73)
Analysis of Variance Response: maxfwt
Factor d.f. Partial SS MS F P
ld72 1 8.972994 8.972994 0.10 0.7538
ld73 1 295.044370 295.044370 3.25 0.0744
REGRESSION 2 747.283558 373.641779 4.12 0.0192
ERROR 95 8613.750674 90.671060
Compressed data and loads faster.
feather packageAllows data queries.
RODBC packageRSQLite packageHmisc package: sas.get, sasxport.gethaven package: read_sas, read_sav, read_dtaforeign packageAn R data.frame consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). Data sets often need to be manipulated to become well-structured.
Hadley Wickham’s Tidy Data provides an excellent overview on how to clean messy data sets.
# politics and religion
senateReligion <- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
Democrats=c(20,15,8,1,1,3),
Republicans=c(38,9,0,5,0,0))
senRel <- cbind(senateReligion[,'religion'], stack(senateReligion[,c('Democrats','Republicans')]))
names(senRel) <- c('religion','count','party')
senRel
pets <- data.frame(county=c('Davidson','Rutherford','Cannon'),
male.dog=c(50,150,70), female.dog=c(60,150,70),
male.cat=c(30,100,50), female.cat=c(30,70,40),
male.horse=c(6,30,20), female.horse=c(6,28,19))
pets2 <- cbind(pets[,'county'], stack(pets[,-1]))
names(pets2) <- c('county','count','ind')
Watch out for factor variables.
tryCatch(strsplit(pets2[,'ind'], '\\.'), error=function(e){ e })
<simpleError in strsplit(pets2[, "ind"], "\\."): non-character argument>
genderAnimal <- strsplit(as.character(pets2[,'ind']), '\\.')
pets2[,c('gender','animal')] <- NA
for(i in seq(nrow(pets2))) {
pets2[i, c('gender','animal')] <- genderAnimal[[i]]
}
pets2[,'ind'] <- NULL
pets2
set.seed(1000)
d1 <- rnorm(1000, 5, 2)
d2 <- runif(1000, 0, 10)
d3 <- rpois(1000, 5)
d4 <- rbinom(1000, 10, 0.5)
x <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
stat=rep(c('mean','sd'),4),
value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
cbind(distribution=x[c(1,3,5,7),'distribution'], unstack(x, form=value~stat))
employees <- data.frame(id=1:4, Name=c('Eddie','Andrea','Steve','Theresa'),
job=c('engineer','accountant','statistician','technician'))
set.seed(10)
hours <- data.frame(id=sample(4, 10, replace=TRUE),
week=1:10, hours=sample(30:60, 10, replace=TRUE))
merge(employees, hours)
# merge(employees, hours, by.x='id', by.y='id')
empHours <- merge(employees, hours, all.x=TRUE)
empHours
unique(empHours[,c('id','Name','job')])
empHours[,c('id','week','hours')]
Exclude rows with missing data.
empHours[!is.na(empHours[,'week']), c('id','week','hours')]
rbind(genderAnimal[[1]], genderAnimal[[2]], genderAnimal[[3]])
[,1] [,2]
[1,] "male" "dog"
[2,] "male" "dog"
[3,] "male" "dog"
do.call(rbind, genderAnimal)
[,1] [,2]
[1,] "male" "dog"
[2,] "male" "dog"
[3,] "male" "dog"
[4,] "female" "dog"
[5,] "female" "dog"
[6,] "female" "dog"
[7,] "male" "cat"
[8,] "male" "cat"
[9,] "male" "cat"
[10,] "female" "cat"
[11,] "female" "cat"
[12,] "female" "cat"
[13,] "male" "horse"
[14,] "male" "horse"
[15,] "male" "horse"
[16,] "female" "horse"
[17,] "female" "horse"
[18,] "female" "horse"
Plot mean while sampling from normal distribution
n <- 100
imean <- numeric(n)
smean <- numeric(n)
for(i in seq(n)) {
imean[i] <- mean(rnorm(10))
smean[i] <- mean(imean[seq(i)])
}
plot(imean, ylab='Sample Mean')
abline(h=0, lty=3)
lines(smean)
n <- 1000
sig <- numeric(n)
for(i in seq(n)) {
grp1 <- rnorm(15, 60, 5)
grp2 <- rnorm(15, 65, 5)
sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)
[1] 0.723
Add some flexibility to our simulation by creating a function.
tTestPower <- function(n=15, mu1=60, mu2=65, sd=5, nsim=1000) {
sig <- numeric(nsim)
for(i in seq(nsim)) {
grp1 <- rnorm(n, mu1, sd)
grp2 <- rnorm(n, mu2, sd)
sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)
}
tTestPower()
[1] 0.758
tTestPower(25, nsim=10000)
[1] 0.9306
There’s already a function to compute the power of a two-sample t test.
power.t.test(n=25, delta=5, sd=5)$power
[1] 0.9337076
true_mu <- 0
x <- rnorm(100, true_mu)
R <- 999
res <- matrix(nrow=R, ncol=6)
colnames(res) <- c('mu','se','lb','ub','coverage','bias')
for(i in seq(R)) {
r <- sample(x, replace=TRUE)
res[i,'mu'] <- mean(r)
res[i,'se'] <- sd(r) / sqrt(length(r))
}
res[,'lb'] <- res[,'mu'] + qnorm(0.025) * res[,'se']
res[,'ub'] <- res[,'mu'] + qnorm(0.975) * res[,'se']
res[,'coverage'] <- res[,'lb'] < true_mu & res[,'ub'] > true_mu
res[,'bias'] <- res[,'mu'] - true_mu
res[1:5,]
mu se lb ub coverage bias
[1,] 0.004925515 0.10318672 -0.19731674 0.2071678 1 0.004925515
[2,] 0.102092092 0.08152406 -0.05769213 0.2618763 1 0.102092092
[3,] -0.015542538 0.08834944 -0.18870427 0.1576192 1 -0.015542538
[4,] -0.040089989 0.10071664 -0.23749097 0.1573110 1 -0.040089989
[5,] -0.047656957 0.10017721 -0.24400068 0.1486868 1 -0.047656957
vals <- colMeans(res)
out <- sprintf("empirical SE: %.3f
MSE: %.3f
basic bootstrap confidence interval: (%.3f, %.3f)
coverage probability: %.3f
mean bias: %.3f
sample mean: %.3f
", sd(res[,'mu']), vals['se'], vals['lb'], vals['ub'], vals['coverage'], vals['bias'], mean(x))
cat(out)
empirical SE: 0.100
MSE: 0.099
basic bootstrap confidence interval: (-0.216, 0.171)
coverage probability: 0.933
mean bias: -0.023
sample mean: -0.021
Wickham’s “Tidy Data” philosophy was concurrent with the development of a suite of R packages useful for data management. This includes ggplot2 and haven as well as many others such as dplyr, stringr, and tidyr.
data.table is another beneficial package for data manipulation. It provides similar functionality to the data.frame, but works well with large data sets.